Back

BMC Medical Genomics

Springer Science and Business Media LLC

All preprints, ranked by how well they match BMC Medical Genomics's content profile, based on 12 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Unlocking Esophageal Carcinomas Secrets: An integrated Omics Approach Unveils DNA Methylation as a pivotal Early Detection Biomarker with Clinical Implications.

Akbar, A.; Zhang, L.; Liu, H.-S.

2023-09-28 health informatics 10.1101/2023.09.26.23296198
Top 0.1%
157× avg
Show abstract

1Esophageal carcinoma (EC) ranks among the top six most prevalent malignancies worldwide with a recent surge in incidence. An innovative integrated omics technique is presented for discerning the two primary types of esophageal carcinoma (EC) AND Squamous cell carcinoma and adenocarcinoma. Utilizing The Cancer Genome Atlas (TCGA) data via Bioconductor, the research integrated DNA methylation and RNA expression analyses for esophageal cancer (ESCA). Key findings revealed DNA methylations pivotal role in ESCA progression and its potential as an early detection biomarker. Significant disparities in methylation patterns offered insights into the diseases pathogenesis. A comparison with the TCGA Pan-Cancer dataset using Bioconductor tools enriched the understanding of ESCA genomics. Specifically, 131,220 hypomethylated probes were detected in tumors compared to 6,248 in healthy tissues. Additionally, 42,060 probe-gene pairs linked methylation variations to expression alterations, with 768 hypomethylated motifs identified. Thirteen of these motifs emerged as potential diagnostic markers. Transcription factor analyses spotlighted crucial regulators, including NFL3, ATF4, JUN, and CEBPG, revealing intricate regulatory networks in ESCA. Survival statistics further correlated clinical factors with patient longevity. This research recommends an innovative approach to identifying oesophageal abnormalities through DNA methylation and gene expression mechanisms. Research suggests DNA methylation may serve as an early detection biomarker, aiding in identifying esophagus cancer prior to more advanced stages.

2
Integrative Multi-Omics Analysis Reveals Novel Molecular Signatures, Disease Stratification and Therapeutic Opportunities in Primary Ciliary Dyskinesia: First AI-ML empowered platform towards precision medicine targeting human ciliopathies

Jitender, ; Hossain, M. W.; Mohanty, S.; Kateriya, S.

2026-01-14 health informatics 10.64898/2026.01.12.26343910
Top 0.1%
150× avg
Show abstract

Primary ciliary dyskinesia (PCD) belongs to the group of rare genetic disorders that is extremely hard to diagnose and treat. Current diagnostic modalities detect only 70% of cases and are technically demanding. It necessitates novel computational approaches for biomarker discovery and the identification of therapeutic targets. We have developed an integrative computational pipeline analysing transcriptomic data from 6 PCD patients and 9 healthy controls. We identified 1,249 differentially expressed genes (false discovery rate below 0.05, absolute log2 fold-change exceeding 1), revealing oxidative stress as a central pathophysiological mechanism, with glutathione S-transferase theta 2B (GSTT2B) emerging as a master regulatory hub. WGCNA detected 12 co-expression modules with three significantly disease-associated modules. The application of machine learning enabled outstanding diagnostic performance with a minimal 10-gene signature, maintaining an accuracy of 0.93. The Random Forest area under the receiver operating characteristic curve was estimated to be 0.96 {+/-} 0.03. This study aided in analyzing uncharacterized genes, such as FRMPD3, C1orf194, and METTL26, which were not previously associated with PCD. The methodology adopted for drug repurposing helped in the identification of FDA-approved drugs, including N-acetylcysteine, metformin, and resveratrol. They appeared as top candidates for therapeutic intervention of PCD. The age-dependent classification revealed that 156 genes exhibited significant disease progression interactions. On the other hand, gender-associated classifications precisely identified 342 sex-specific responsive genes. BackgroundPrimary ciliary dyskinesia (PCD), is considered a rare genetic disorder that arises due to ciliary dysfunction. It causes severe respiratory illness including chronic infections, bronchiectasis, and morbidity. Although more than 50 PCD genes have been identified, the molecular mechanisms underlying PCD pathophysiology remain unclear. This obscurity leads to failed therapeutic interventions, highlighting the need for robust PCD-specific molecular characterization. MethodsThis study has incorporated an integrated computational analysis of transcriptomic data obtained from the GSE25186 dataset. This dataset encompasses nasal epithelial cells samples extracted from six and nine confirmed cases of PCD and healthy controls respectively. Different approaches were undertaken in this study. These included empirical Bayes moderated t statistics, weighted gene co-expression network analysis (WGCNA) with soft threshold {beta}=6, comprehensive pathway enrichment across KEGG, Reactome, and GO databases, machine learning classification using Random Forest and Support Vector Machines, temporal trajectory inference through pseudotime analysis, and systematic drug repurposing screening against DrugBank v5.1.8 and ChEMBL v29 databases. ResultsWe identified 1,249 differentially expressed genes (adjusted p-value < 0.05, |log2FC| > 1), comprising 533 upregulated and 716 downregulated genes. The application of WGCNA identified 12 co-expression modules that were found to be associated with three different modules. These three modules were brown module: r = 0.78, p = 2x10-, blue module: r = - 0.65, p = 0.008, and green module: r = 0.82, p = 0.001). The machine learning tools yielded outstanding diagnostic performance, with a Random Forest AUC value of 0.96 {+/-} 0.03. This led to the generation of a minimal 10-gene diagnostic signature. This study identified N-acetylcysteine (NAC) as the top therapeutic candidate, with enhanced potential for treating PCD. The other candidates, metformin and resveratrol, had composite scores of 1.85 and 0.28, respectively, whereas NAC possessed a composite score of 2.46. Systems biology-based classification by age revealed progressive molecular deterioration. A total of 156 genes had a significant age x disease interaction, with a false detection rate of less than 0.05. Gender stratification located 342 genes that were differentially responsive, leading to the design of male/female-dependent therapeutic interventions. ConclusionsThe multi-omics analysis gives significant revelations onto PCD molecular pathophysiology. The oxidative stress (GSTT2B, GPX1, SOD2) mechanism and protein homeostasis disruption (HSPA8, PDIA3, CALR) served as central regulators for disease progression. This study helps to gain novel insights into reliable diagnostic markers, FDA-approved and readily available drug candidates for PCDs therapeutic interventions. Further, age and gender associated classification of biological markers in PCD offers novel path for tailored medicines. This study established a robust molecular framework for therapeutics of rare genetic diseases.

3
m6A RNA methylation regulators contribute to progression and impact the prognosis of breast cancer

Wenjie, J.; Minglong, D.; Zebin, H.; Kaidi, W.; Han, W.

2020-10-16 health informatics 10.1101/2020.10.13.20212332
Top 0.1%
149× avg
Show abstract

N6-methyladenosine (m6A) is the most commonly modified form of mRNA. M6A RNA methylation regulators are proved to be expressed clearly in some cancers by plenty of studies. Moreover, they also are proved to be indirectly involved in the growth of cancers. However, it remains unclear that the role of m6A RNA methylation regulator in the prognosis of breast cancer (BRCA). The data that we used in this study is the mRNA expression data obtained from the corresponding clinical information and the Tumor Genome Atlas (TCGA) database. And the goal we used the Wilcoxon rank-sum test was to evaluate the difference in the expression of m6A RNA methylation regulators in the normal group and the tumor group, and analyze the correlation between m6A RNA methylation regulators. We identified two subgroups of BRCA (cluster1 and 2) by using the K-mean algorithm and analyzing the correlation between clinic information and subgroups. The LASSO regression model then was used to figure out three m6A RNA methylation regulators, namely YTHDF3, ZC3H13, and HNRNPC. The riskScore of each patient was calculated according to the regression coefficients of the three m6A RNA methylation regulators. Base on the riskScore, we divided the patients into two groups, the high-risk group, and the low-risk group. After analyzing, we found that the overall survival rate (OS) of the low-risk group was higher than that of the other group. We conducted a univariate and multi-factor independent prognostic analysis of riskScore and three m6A RNA methylation regulators, and found that riskScore has a significant correlation with BRCA. In conclusion, the m6A RNA methylation regulator is closely related to the development of BRCA, and the prognostic factor riskScore obtained from the regression of the expression of the three m6A RNA methylation regulators in the human body are likely to guide the individualization of BRCA patients A useful prognostic biomarker for treatment.

4
Hot Spring Residency and Disease Association: a Crossover Gene-Environment Interaction (GxE) Study in Taiwan

Wu, H.-Y.; Chang, K.-J.; Chiu, W.; Wang, C.-Y.; Hsu, Y.-T.; Wen, Y.-C.; Chiang, P.-H.; Chen, Y.-H.; Dai, H.-J.; Lu, C.-H.; Chen, Y.-C.; Tsai, H.-Y.; Chen, Y.-C.; Hsu, C.-H.; Hsieh, A.-R.; Chiou, S.-H.; Yang, Y.-P.; Hsu, C.-C.

2024-07-30 health informatics 10.1101/2024.07.29.24311167
Top 0.1%
147× avg
Show abstract

BackgroundThe advent of genetic biobanking has powered gene-environment interaction (GxE) studies in various disease contexts. Therefore, we aimed to discover novel GxE effects that address hot spring residency as a risk to inconspicuous disease association. MethodsA complete genetic and demographic registry comprising 129,451 individuals was obtained from Taiwan Biobank (TWB). Geographical disease prevalence was analyzed to identify putative disease association with hot-spring residency, multivariable regression and logistic regression were rechecked to exclude socioeconomic confounders in geographical-disease association. Genome-wide association study (GWAS), gene ontology (GO), and protein-protein interaction (PPI) analysis identified predisposing genetic factors among hotspring-associated diseases. Lastly, a polygenic risk score (PRS) model was formulated to stratify environmental susceptibility in accord to their genetic predisposition. ResultsAfter socioeconomic covariate adjustment, prevalence of dry eye disease (DED) and valvular heart disease (VHD) was significantly associated with hot spring distribution. Through single nucleotide polymorphisms (SNPs) discovery and subsequent PPI pathway aggregation, CDKL2 and BMPR2 kinase pathways were significantly enriched in hot-spring specific DED and VHD functional SNPs. Notably, PRS predicted disease well in hot spring regions (PRSDED: AUC=0.9168; PRSVHD AUC=0.8163). Hot spring and discovered SNPs contributed to crossover GxE effect on both DED (relative risk (RR)G+E-=0.99; RRG+E+=0.35; RRG+E+=2.04) and VHD (RRG+E-=0.99; RRG+E+=0.49; RRG+E+=2.01). ConclusionWe identified hot-spring exposure as a modifiable risk in the PRS predicted GxE context of DED and VHD. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=88 SRC="FIGDIR/small/24311167v1_ufig1.gif" ALT="Figure 1"> View larger version (33K): org.highwire.dtl.DTLVardef@1abdc8org.highwire.dtl.DTLVardef@1a1b138org.highwire.dtl.DTLVardef@79dea0org.highwire.dtl.DTLVardef@106f029_HPS_FORMAT_FIGEXP M_FIG C_FIG

5
Dynamic, stage-course protein interaction network using high power CpG sites in Head and Neck Squamous Cell Carcinoma

Riaz, A.; Shah, M.; Zaheer, S.; Salam, A.; Khan, F. F.

2021-07-05 health informatics 10.1101/2021.06.30.21259548
Top 0.1%
126× avg
Show abstract

Head and neck cancer is the sixth leading cause of cancer across the globe and is significantly more prevalent in South Asian countries, including Pakistan. Prediction of pathological stages of cancer can play a pivotal role in early diagnosis and personalized medicine. This project ventures into the prediction of different stages of head and neck squamous cell carcinoma (HNSCC) using prioritized DNA methylation patterns. DNA methylation profiles for each HNSCC stage (stage-I-IV) were used to extensively analyze 485,577 methylation CpG sites and prioritize them on the basis of the highest predictive power using a wrapper-based feature selection method, along with different classification models. We identified 68 high-power methylation sites which predicted the pathological stage of HNSCC samples with 90.62 % accuracy using a Random Forest classifier. We set out to construct a protein-protein interaction network for the proteins encoded by the 67 genes associated with these sites to study its network topology and also undertook enrichment analysis of nodes in their immediate neighborhood for GO and KEGG Pathway annotations which revealed their role in cancer-related pathways, cell differentiation, signal transduction, metabolic and biosynthetic processes. With information on the predictive power of each of the 67 genes in each HNSCC stage, we unveil a dynamic stage-course network for HNSCC. We also intend to further study these genes in light of functional datasets from CRISPR, RNAi, drug screens for their putative role in HNSCC initiation and progression.

6
Development of molecular subtype specific prognostic marker signature in immune response associated colon cancer through fuzzy based transcriptomic approach

Kochan, N.; Dayanc, B. E.

2023-05-24 health informatics 10.1101/2023.05.18.23290045
Top 0.1%
120× avg
Show abstract

ObjectiveThe molecular heterogeneity of colon cancer makes the prediction of disease prognosis challenging. In order to resolve this heterogeneity, molecular tumor subtyping present solutions. These approaches are expected to contribute to clinical decision-making. In this study, we aimed to identify Consensus Molecular Subtype (CMS) specific prognostic genes of colon cancer, focusing on anti-tumor immune-response associated CMS1, through a fuzzy-based machine learning approach. Materials and MethodsWe applied Fuzzy C-Means (FCM) clustering to stratify patients into two groups and identified genes that predict significant disease-specific survival difference between groups. We then performed Cox regression analyses to identify the most significant genes associated with disease-specific survival. A subtype-specific risk score and a final risk score formulae were constructed and used to calculate risk scores to stratify patients into low and high-risk groups within each CMS (1 to 4) or independent of CMS respectively. ResultsWe identified CMS-specific genes and an overall 11-gene signature for prognostic risk prediction based on the disease-specific survival of colon cancer patients. The patients in both discovery and test cohorts were stratified into high and low-risk groups using subtype risk scores. The disease-specific survival of these risk groups within each CMS, except CMS3, was significantly different for both discovery and test cohorts. Discussion and ConclusionsWe have identified novel prognostic genes with potential immune regulatory roles within the immune-response associated CMS1. The low number of patients in the CMS3 cohort prevented subtype-specific prognostic gene validation. Tumor stage grouping of the validation cohort suggested the best prediction of prognosis in tumor stage III patients. In conclusion, newly identified eleven genes can efficiently predict the prognostic risk of colon cancer patients and classify patients into corresponding risk groups. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=104 SRC="FIGDIR/small/23290045v1_ufig1.gif" ALT="Figure 1"> View larger version (39K): org.highwire.dtl.DTLVardef@d10165org.highwire.dtl.DTLVardef@1b2b64forg.highwire.dtl.DTLVardef@1df3bfborg.highwire.dtl.DTLVardef@dae4e7_HPS_FORMAT_FIGEXP M_FIG C_FIG

7
Genetic crossroads of cardiovascular disease and its comorbidities: Toward holistic therapeutic strategies

Mishra, P. P.; Mishra, B.

2025-06-18 health informatics 10.1101/2025.06.17.25329808
Top 0.1%
102× avg
Show abstract

With increasing life expectancy, the prevalence of cardiovascular disease (CVD) accompanied by comorbidities is rising, presenting a growing challenge for healthcare systems. Understanding shared genetic factors underlying CVD and its comorbidities can help develop more effective prevention and treatment strategies. In this study, we investigated genetic correlations between CVD and common comorbidities using genome-wide association study (GWAS) summary statistics from the FinnGen R12 release. Following standard quality control procedures, we examined 19 disease endpoints using linkage disequilibrium score regression (LDSC) to estimate heritability and pairwise genetic correlations. Disease traits with significant heritability (z-score [&ge;] 4) and Bonferroni-corrected significant correlations (adjusted p < 0.05) were selected for genomic structural equation modeling (Genomic SEM) to construct a latent genomic factor (LGF), representing shared genetic liability. Out of the 19 diseases, four CVDs (transient ischemic attack, atrial fibrillation, myocardial infarction and heart failure) and seven comorbidities (type 2 diabetes, asthma, obesity, depression, chronic obstructive pulmonary disease, gingivitis and hypertension) showed statistically significant genetic correlations. A multivariate GWAS of the LGF identified 141 novel associated loci across 29 independent SNPs. These loci overlapped with 16 protein-coding genes, including NPC1, TMEM106B, PTPN22, MAP2K5 and MSRA, implicating them in the shared pathogenesis of CVD and its comorbidities. These findings highlight a shared genetic architecture underlying CVD and its comorbidities, revealing cross-disease genetic risk factors that may enhance joint risk prediction, inform precision medicine strategies, and offer insights into common biological mechanisms and potential targets for integrated therapeutic approaches.

8
Decreased retinal vascular complexity is an early biomarker of MI supported by a shared genetic control.

Villaplana Velasco, A.; Engelmann, J.; Rawlik, K.; Canela-Xandri, O.; Tochel, C.; Lona-Durazo, F.; Mookiah, M. R. K.; Doney, A.; Parra, E.; Trucco, E.; Macgillivray, T.; Rannikmae, K.; Tenesa, A.; Pairo-Castineira, E.; Bernabeu, M. O.

2021-12-16 health informatics 10.1101/2021.12.16.21267446
Top 0.1%
94× avg
Show abstract

There is increasing evidence that the complexity of the retinal vasculature (measured as fractal dimension, Df) might offer earlier insights into the progression of coronary artery disease (CAD) before traditional biomarkers can be detected. This association could be partly explained by a common genetic basis; however, the genetic component of Df is poorly understood. We present here a genome-wide association study (GWAS) aimed to elucidate the genetic component of Df and to analyse its relationship with CAD. To this end, we obtained Df from retinal fundus images and genotyping information from [~]38,000 white-British participants in the UK Biobank. We discovered 9 loci associated with Df, previously reported in pigmentation, retinal width and tortuosity, hypertension, and CAD studies. Significant negative genetic correlation estimates endorse the inverse relationship between Df and CAD, and between Df and myocardial infarction (MI), one of CAD fatal outcomes. This strong association motivated us to developing a MI predictive model combining clinical information, Df, a CAD polygenic risk score and using a random forest algorithm. Internal cross validation evidenced a considerable improvement in the area under the curve (AUC) of our predictive model (AUC=0.770) when comparing with an established risk model, SCORE, (AUC=0.719). Our findings shed new light on the genetic basis of Df, unveiling a common control with CAD, and highlights the benefits of its application in individualised MI risk prediction.

9
Identification of Blood miRNA Biomarkers in Systemic Tuberculosis through Metadata Analysis

CHADALAWADA, S.; Rathinam, S.; Devarajan, B.

2025-08-14 health informatics 10.1101/2025.08.12.25333471
Top 0.1%
89× avg
Show abstract

Individual studies of miRNA analysis from small-RNA sequencing data can produce contradicting results. However, metadata analysis is used to overcome inconsistent findings between different studies. Thus, we aim to identify dysregulated miRNAs in systemic tuberculosis (TB) patients through metadata analysis using small-RNA sequencing data. 131 samples from seven different datasets were downloaded from the Sequence Read Archive (SRA) database. Among them, 45 were healthy controls, 47 with active TB, and 39 with Latent TB (LTB). First, we identified differentially expressed (DEs) miRNAs in active TB and LTB samples compared to controls. A total of 52 miRNAs were filtered based on their role in TB-specific, active TB, LTB-specific, and disease progression. These miRNAs may play an important role in TB disease progression from LTB to active TB. Subsequently, we performed gene enrichment and network analysis for both upregulated and downregulated miRNAs. From that, we selected eight miRNAs, hsa-miR-155-5p, hsa-miR-223-3p, hsa-miR-32-3p, hsa-miR-374a-3p, hsa-miR-374a-5p, hsa-miR-582-5p, hsa-miR-320d, and miR-122-5p served as biomarkers based on their role in TB pathogenesis through PI3K-Akt signaling pathway, TNF-signaling pathway, phagosome, and NOD-like signaling pathway. For the first time, we performed a metadata analysis with small-RNA sequencing data from publicly available datasets and identified miRNAs that could serve as biomarkers for systemic TB, which require further experimental confirmation.

10
Machine learning approach to assess the pathogenicity of BRCA1/2 genetic variants : brca-NOVUS

Vatsyayan, A.; Scaria, V.

2023-10-20 health informatics 10.1101/2023.10.20.23297295
Top 0.1%
84× avg
Show abstract

Breast cancer is globally the leading type of cancer in terms of both incidence and mortality. BRCA1 and BRCA2 gene variants have long been linked to and studied in context of the disease. Rapid variant discovery has further been made freely accessible by advances in Next-generation sequencing, making it a demanding task to accurately interpret these variants for clinical and research applications. To establish the nature of these variants, the American College of Medical Genetics and Genomics and the Association of Molecular Pathologists (ACMG-AMP) have issued a set of guidelines for variant classification. However, given the huge number of variants associated with the two large and well-studied genes, functional studies or ACMG-AMP classification is a mountainous challenge. Here we describe brca-NOVUS, a machine learning approach trained on a gold-standard ACMG-qualified dataset for the accurate interpretation of variants at large scale. Using two independent test and validation datasets of ACMG-qualified variants, we show that brca-NOVUS can be used to for the classification of variants in clinical as well as research settings.

11
Phenome-wide causal proteomics enhance systemic lupus erythematosus flare prediction: A study in Asian populations

Chen, L.; DENG, O.; Fang, T.; Chen, M.; Zhang, X.; Cong, R.; Lu, D.; Zhang, R.; Jin, Q.; Wang, X.

2024-11-18 health informatics 10.1101/2024.11.17.24317460
Top 0.1%
59× avg
Show abstract

ObjectiveSystemic lupus erythematosus (SLE) is a complex autoimmune disease characterized by unpredictable flares. This study aimed to develop a novel proteomics-based risk prediction model specifically for Asian SLE populations to enhance personalized disease management and early intervention. MethodsA longitudinal cohort study was conducted over 48 weeks, including 139 SLE patients monitored every 12 weeks. Patients were classified into flare (n = 53) and non-flare (n = 86) groups. Baseline plasma samples underwent data-independent acquisition (DIA) proteomics analysis, and phenome-wide Mendelian randomization (PheWAS) was performed to evaluate causal relationships between proteins and clinical predictors. Logistic regression (LR) and random forest (RF) models were used to integrate proteomic and clinical data for flare risk prediction. ResultsFive proteins (SAA1, B4GALT5, GIT2, NAA15, and RPIA) were significantly associated with SLE Disease Activity Index-2K (SLEDAI-2K) scores and 1-year flare risk, implicating key pathways such as B-cell receptor signaling and platelet degranulation. SAA1 demonstrated causal effects on flare-related clinical markers, including hemoglobin and red blood cell counts. A combined model integrating clinical and proteomic data achieved the highest predictive accuracy (AUC = 0.769), surpassing individual models. SAA1 was highlighted as a priority biomarker for rapid flare discrimination. ConclusionThe integration of proteomic and clinical data significantly improves flare prediction in Asian SLE patients. The identification of key proteins and their causal relationships with flare-related clinical markers provides valuable insights for proactive SLE management and personalized therapeutic approaches. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=191 SRC="FIGDIR/small/24317460v1_ufig1.gif" ALT="Figure 1"> View larger version (37K): org.highwire.dtl.DTLVardef@1137aceorg.highwire.dtl.DTLVardef@1e3773eorg.highwire.dtl.DTLVardef@a9af65org.highwire.dtl.DTLVardef@3a186e_HPS_FORMAT_FIGEXP M_FIG C_FIG

12
Investigating the likely association between genetic ancestry and COVID-19 manifestation

Das, R.; Ghate, S. D.

2020-04-07 health informatics 10.1101/2020.04.05.20054627
Top 0.1%
59× avg
Show abstract

BackgroundThe novel coronavirus: severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread rapidly throughout the world leading to catastrophic consequences. However, SARS-CoV-2 infection has shown discernible variability across the globe. While in some countries people are recovering relatively quickly, in others, recovery times have been comparatively longer and number of individuals succumbing to it are high. This variability in coronavirus disease 2019 (COVID-19) susceptibility is suggestive of a likely association between the genetic-make up of affected individuals modulated by their ancestry and the severity of COVID-19 manifestations. ObjectiveIn this study, we aimed to evaluate the potential association between an individuals genetic ancestry and the extent of COVID-19 disease presentation employing Europeans as the case study. In addition, using a genome wide association (GWAS) approach we sought to discern the putative single nucleotide polymorphism (SNP) markers and genes that may be likely associated with differential COVID-19 manifestations by comparative analyses of the European and East Asian genomes. MethodTo this end, we employed 10,215 ancient and modern genomes across the globe assessing 597,573 SNPs obtained from the databank of Dr. David Reich, Harvard Medical School, USA to evaluate the likely correlation between European ancestry and COVID-19 manifestations. Ancestry proportions were determined using qpAdm program implemented in AdmixTools v5.1. Pearsons correlation coefficient (r) between various ancestry proportions of European genomes and COVID-19 death/recovery ratio was calculated and its significance was statistically evaluated. Genome wide association study (GWAS) was performed in PLINK v1.9 to investigate SNPs with significant allele frequency variations among European and East Asian genomes that likely correlated with differential COVID-19 infectivity. ResultsWe found significant positive correlation (r=0.58, P=0.03) between West European hunter gatherers (WHG) ancestral fractions and COVID-19 death/recovery ratio for data as of 5th April 2020. This association discernibly amplified (r=0.77, P=0.009) upon reanalyses based on data as of 30th June 2020, removing countries with small sample sizes and adding those that are a bridge between Europe and Asia. Using GWAS we further identified 404 immune response related SNPs by comparing publicly available 753 genomes from various European countries against 838 genomes from various Eastern Asian countries. Prominently, we identified that SNPs associated with immune-system related pathways such as interferon stimulated antiviral response, adaptive and innate immune system and IL-6 dependent immune responses show significant differences in allele frequencies [Chi square values ([&ge;]1500; P{approx}0)] between Europeans and East Asians. ConclusionSo far, to the best of our knowledge, this is the first study investigating the likely association between host genetic ancestry and COVID-19 severity. These findings improve our overall understanding of the putative genetic modifiers of COVID-19 clinical presentation. We note that the development of effective therapeutics will benefit immensely from more detailed analyses of individual genomic sequence data from COVID-19 patients of varied ancestries.

13
DNA-Based Deep Learning and Association Studies for Drug Response Prediction in Leiomyosarcoma

Ebrahimi, A.

2025-10-24 health informatics 10.1101/2025.10.22.25338578
Top 0.1%
58× avg
Show abstract

Leiomyosarcoma (LMS) is a rare and aggressive soft tissue sarcoma with limited treatment options and poor prognosis. Standard therapies, including doxorubicin, gemcitabine, trabectedin, and pazopanib, demonstrate variable efficacy across patients, underscoring the need for predictive biomarkers and computational models to inform personalized therapy. We developed a deep learning framework using DNA mutation and expression data to predict multi-task binary drug responses in LMS. Feedforward neural networks (FNNs) and transformer-based models were trained with binary cross-entropy (BCE) and weighted BCE (WBCE) loss functions to address class imbalance. In addition to predictive modeling, we conducted statistical association studies to identify links between genomic alterations and drug sensitivity, and performed Kaplan-Meier survival analyses to assess the prognostic relevance. Transformer models outperformed FNN baselines, achieving an overall F1-score of 0.87. Association studies revealed biologically meaningful links: TP53 mutations correlated with doxorubicin resistance, RB1 deletions with gemcitabine non-response, ATRX mutations with poor pazopanib outcomes, and MDM2 amplification with trabectedin resistance. This study demonstrates the utility of DNA-driven deep learning combined with association studies for predicting drug responses in LMS. Our framework not only provides multi-task binary predictions but also yields biologically interpretable associations for the targeted DNAs, highlighting key genomic drivers of therapy resistance. These findings support the development of precision oncology strategies for this rare and challenging cancer.

14
Predicting ovarian/breast cancer pathogenic risks of BRCA1 gene variants of unknown significance

Lin, H.-H.; Xu, H.; Hu, H.; Ma, Z.; Zhou, J.; Liang, Q.

2020-06-05 health informatics 10.1101/2020.06.04.20120055
Top 0.1%
58× avg
Show abstract

The difficulty of early diagnosis for ovarian cancer is an important cause of the high mortal rates of ovarian cancer patients. Instead of symptom-based diagnostic methods, modern sequencing technologies enable the access of humans genetic information via reading DNA/RNA molecules nucleotide base sequences. In such way, genes mutations and variants could be identified and hence a better clinical diagnosis in molecular level could be expected. However, as sequencing technologies gain more popularity, novel gene variants with unknown clinical significance are found, giving difficulties to interpretations of patients genetic data, precise disease diagnoses as well as the making of therapeutic strategies and decisions. In order to solve these issues, it is of critical importance to figure out ways to analyze and interpret such variants. In this work, BRCA1 gene variants with unknown clinical significance were identified from clinical sequencing data, and then we developed machine learning models so as to predict the pathogenicity for variants with unknown clinical significance. Amongst, in performance benchmarking, our optimized random forest model scored 0.85 in area under receiver-operating characteristic curve, which outperformed other models. Finally, we applied the optimized random forest model to predict the pathogenic risks of 7 BRCA1 variants of unknown clinical significances identified from our sequencing data, and 6315 variants of unknown clinical significance in ClinVar database. As a result, our model predicted 4724 benign and 1591 pathogenic variants, which helped the interpretation of these variants of unknown significance and diagnosis.

15
Can the protection be among us? Previous viral contacts and prevalent HLA alleles could be avoiding an even more disseminated COVID-19 pandemic.

Antonio, E. C.; Meireles, M. R.; Bragatte, M. A. S.; Vieira, G. F.

2020-06-17 health informatics 10.1101/2020.06.15.20131987
Top 0.1%
57× avg
Show abstract

COVID-19 is bringing scenes of sci-fi movies into real life, and it seems to be far from over. Infected individuals exhibit variable severity, suggesting the involvement of the genetic constitution of populations and previous cross-reactive immune contacts in the individuals disease outcome. To investigate the participation of MHC alleles in COVID-19 severity, the combined use of HLA-B*07, HLA-B*44, HLA-DRB1*03, and HLA-DRB1*04 grouped affected countries presenting similar death rates, based only on their allele frequencies. To prospect T cell targets in SARS-CoV-2, we modeled 3D structures of HLA-A*02:01 complexed with immunogenic epitopes from SAR-CoV-1 and compared them with models containing the corresponding SARS-CoV-2 peptides. It reveals molecular conservation between SARS-CoV peptides, evidencing that the corresponding current sequences are putative T cell epitopes. These structures were also compared with other HCoVs sequences, and with a panel of epitopes from unrelated viruses, looking for the triggers of cross-protection in asymptomatic and uninfected individuals. 229E, OC43, and impressively, viruses involved in endemic human infections share fingerprints of immunogenicity with SARS-CoV peptides. Wide-scale HLA genotyping in COVID-19 patients shall improve prognosis prediction. Structural identification of previous triggers paves the way for herd immunity examination and wide spectrum vaccine development.

16
Blood lipid density decreases at the early-stage of stomach adenocarcinoma

Prakash, O.; Khan, F.

2023-03-02 health informatics 10.1101/2023.03.01.23286666
Top 0.1%
57× avg
Show abstract

BackgroundInduction of cancer creates many molecular to physiological changes in the human body. Observation of these variations between normal & diseased conditions become the basis of disease diagnosis. Present study analyzed blood lipid profile, relative to changes in genotype, at the early stage of stomach adenocarcinoma. Materials and MethodPresent study was based on establishment of relationship between genotype to phenotype. Genotypic features were collected through RNAseq analysis, which was further mapped with phenotypic expression in the form of blood lipid profile. ResultsTo observe the significance difference between phenotypic expressions of normal and cancerous condition, gene signatures from multiple sources of studies were mapped with blood lipid profile including: Total Cholesterol, LDL Cholesterol, HDL Cholesterol, Triglycerides, Non-HDL-C, and TG to HDL ratio. Significance difference found between phenotypic expression of normal and cancerous condition. ConclusionThrough multi-signature-based population observation, it was found that blood-lipid density decreases at the early-stage of stomach adenocarcinoma. Further, blood-lipid profile can be used for early disease prediction of stomach adenocarcinoma as well as other cancer types.

17
Multi-omics integration predicts 17 disease incidences in the UK Biobank

Du, J.; Zhou, M.; Raffield, L. M.; Zhou, R.; Li, Y.; Chen, C.; Sun, Q.

2025-08-05 health informatics 10.1101/2025.08.01.25332841
Top 0.1%
56× avg
Show abstract

ImportanceTraditional clinical predictors for disease risks have limitations in capturing underlying disease complexity. Multi-omics technologies, such as metabolomics and proteomics, offer deeper molecular perspectives that could enhance risk prediction, but large-scale studies integrating the two omics are scarce. ObjectivesThe primary objective is to systematically evaluate whether adding metabolomics and/or proteomics data to traditional clinical predictors improves risk prediction for 17 common incident diseases. A secondary objective is to identify key disease-related omics features. Data Sources and ParticipantsOur study incorporated 23,776 UK Biobank participants who had complete baseline omics data for 159 NMR-based metabolites and 2,923 Olink affinity-based proteins. Main Outcomes and MeasuresWe evaluated the model prediction of 17 incident diseases by fitting Cox proportional hazard models and obtaining Harrells C-index. Feature importance scores were calculated to identify key molecules contributing to each disease risk prediction. ResultsAdding omics data significantly improved risk prediction for all 17 diseases compared to models with clinical predictors alone (p-value < 2E-4). Proteomics-only models generally demonstrated superior predictive performance over metabolomics-only models for 14 of the 17 endpoints. We also identified key proteins, including established biomarkers like KLK3 (PSA) for prostate cancer and CRYBB2 for cataracts. Conclusion and RelevanceIntegration of Olink proteomics, and to a lesser extent Nightingale metabolomics, substantially improves risk prediction for a wide range of common diseases beyond established clinical factors. These findings highlight the clinical utility of proteomics for enhancing individual risk prediction and provide molecular insights into disease mechanisms, which may potentially guide future therapeutic development. Key PointsO_ST_ABSQuestionC_ST_ABSDo multi-omics profiles improve disease risk prediction compared to models using only traditional clinical risk factors and what is the best strategy to integrate metabolomics and proteomics in disease prediction? FindingsIn this study, we investigated 17 incident diseases across 23,776 UK Biobank individuals with complete records of both Nightingale metabolomics and Olink proteomics profiles, and found that integrating omics data significantly enhanced disease prediction over traditional approaches, with Olink proteomics consistently providing more predictive power than Nightingale metabolomics for most diseases. We also identified key proteins, including both well-established ones like KLK3 (PSA) for prostate cancer and potential novel ones like PRG3 for skin cancer. We also connected diseases with medication, socioeconomic, demographic, and lifestyle risk factors through these key proteins. MeaningOur findings suggest the potential clinical utility of integrating multi-omics in risk prediction and biomedical discoveries. To the best of our knowledge, our study is currently the largest to systematically evaluate contributions of both metabolomics and proteomics profiles to the prediction of various incident clinical endpoints.

18
Classification of colon cancer patients into Consensus Molecular Subtypes using Support Vector Machines

Kochan, N.; Dayanc, B. E.

2023-05-23 health informatics 10.1101/2023.05.22.23290335
Top 0.1%
56× avg
Show abstract

ObjectiveThe molecular heterogeneity of colon cancer has made classification of tumors a requirement for effective treatment. One of the approaches for molecular subtyping of colon cancer patients is the Consensus Molecular Subtypes (CMS) developed by the Colorectal Cancer Subtyping Consortium (CRCSC). CMS-specific RNA-Seq dependent classification approaches are recent with relatively low sensitivity and specificity. In this study, we aimed to classify patients into CMS groups using RNA-seq profiles. MethodsWe first identified subtype specific and survival associated genes using Fuzzy C-Means (FCM) algorithm and log-rank test. Then we classified patients using Support Vector Machines with Backward Elimination methodology. ResultsWe optimized RNA-seq based classification using 25 genes with minimum classification error rate. Here we report the classification performance using precision, sensitivity, specificity, false discovery rate and balanced accuracy metrics. ConclusionWe present the gene list for colon cancer classification with minimum classification error rates. We observed the lowest sensitivity but highest specificity with CMS3-associated genes, which is significant due to low number of patients in the clinic for this group.

19
Investigating The Causal Link Between Serum Iron Status and Pernicious Anaemia Risk: A Mendelian Randomisation Study

Comesana Cimadevila, G.; Thain, A.; Dib, M.-J.; Ahmadi, K. R.

2024-10-29 health informatics 10.1101/2024.10.28.24316258
Top 0.1%
56× avg
Show abstract

IntroductionPernicious anaemia (PA) is characterised by vitamin B12 deficiency due to autoimmune-mediated destruction of gastric parietal cells and the consequent loss of intrinsic factor. A considerable proportion of PA patients also exhibit iron deficiency (ID), both before or at (20.7-52%) and after (46.4%) PA diagnosis. However, findings from observational studies do not clarify whether ID contributes to PA risk or is a consequence of the PA disease process. Given the high prevalence of ID at PA diagnosis, we hypothesised that reduced iron status may play a causal role in PA risk. MethodologyWe conducted two-sample Mendelian Randomisation (MR) analyses to evaluate the causal effect between systemic iron status and PA risk. Genetic association data for iron status were sourced from the deCODE study. Additionally, PA-relevant associative data with the chosen SNPs were obtained from genome-wide-association-study (GWAS) summary statistics, primarily from R10 FinnGen release and from the GWAS conducted by Laisk et al. (2021) for replication purposes. Participant data consisted of 3,694 cases of PA and 393,684 controls. Inverse-variance weighted analysis was the primary MR method, with sensitivity analyses including Egger, and weighted-median estimates, additionally to testing for horizontal pleiotropy and heterogeneity. ResultsFour SNPs were strongly associated with systemic iron status and were used as genetic instruments. We found that genetically predicted iron status was not significantly associated with PA risk (odds ratio per 1 standard deviation increase in serum iron: 1.13, 95% confidence interval 0.80-1.58, P=0.49). Sensitivity analyses had consistent results, indicating that MR assumptions were not violated and highlighting a null result subjectivity due to the presence of horizontal pleiotropy and heterogeneity. ConclusionThis is the first MR study investigating the potential causal relationship between iron status and PA risk. Our results show that genetically predicted iron status is not associated with a significantly increased PA risk among individuals of European ancestry. Further research is needed to understand the manifestation of ID in PA.

20
Diagnostic and prognostic value of the gasdermins in gastric cancer

xu, y.; Wan, C.; wang, p.; Gu, Y.

2023-10-24 genetic and genomic medicine 10.1101/2023.10.18.23297225
Top 0.1%
55× avg
Show abstract

BackgroundPyroptosis has been drawn attention owing to its contributory role in various cancers. Recently, the participator of pyroptosis, gasdermins (GSDMs) have been reported associated with of multiple types of cancers. However, the role of GSDMs expression in diagnosis and prognosis of gastric cancer (GC) has not been well elucidated. Moreover, the mechanisms underlying the carcinogenesis of GC are still obscure. MethodsHerein we analyzed the transcriptional, prognostic information and the role of GSDMs in patients with GC from TIMER, UALCAN, Human Protein Atlas (HPA), GEPIA and Kaplan-Meier plotter databases. The cBioPortal online tool was used to analyze the GSDMs alterations, correlations, and networks. Furthermore, String, Cytoscape and TIMER were conducted to explore the functional enrichment and immune modulation. The statistical analysis was carried out in the R environment, and P-value < 0.05 was considered statistically significant. ResultsGSDMB, GSDMC, GSDMD, GSDME were with higher expression in GC than normal tissue in TIMER database. Moreover, survival analyses via two databases both demonstrated that high expression of GSDME was related to shorter overall survival (OS) in patients with GC. Additionally, functional enrichment revealed that GSDMs might be involved in endopeptidase activity, peptidase regulator activity, cysteine-type peptidase activity. Besides, GSDMs were correlated with infiltration levels of immune cells in GC, and GSDME was correlated with the infiltrating level of CD4+ T, CD8+ T, neutrophils, macrophages and dendritic cells. ConclusionsThe study systematically indicated the potential diagnostic and prognostic value of GSDMs in GC. Our results showed that GSDME might play a considerably oncogenic role in GC diagnosis and prognosis. However, our bioinformatics analyses should be further validated in more prospective studies.